AirBnB - A Sentiment Analysis

Introduction

What is Airbnb and how does it work?

A community built on sharing

"Airbnb began in 2008 when two designers who had space to share hosted three travelers looking for a place to stay. Now, millions of hosts and travelers choose to create a free Airbnb account so they can list their space and book unique accommodations anywhere in the world. And Airbnb experience hosts share their passions and interests with both travelers and locals."

Trusted services

"Airbnb helps make sharing easy, enjoyable, and safe. We verify personal profiles and listings, maintain a smart messaging system so hosts and guests can communicate with certainty, and manage a trusted platform to collect and transfer payments."

What Information Do We Have?

The source of the data is kaggle.com. The data used in this analysis is for two cities: Boston and Seattle

The data files are for calendar, reviews and listings. The listings datafile contains one observation per listings, with information related to: Basic information (location, space, host, images (of listing and host), availability), Reviews, and Price. The calendar and reviews datafiles contain multiple entries per listing relating to individual availability data and reviews.

Questions

For anyone new to AirBnB (like me), the most obvious questions relate to:

Location

We can plot the locations for AirBnB listings for both Boston and Seattle. Surprisingly (for me) there are listings dotted around each city.

We can aggregate the listings by zip code, to get a better sense of how many listings there are:

Availability

Even though there is a suprisingly disperse range of locations for each city, how often are those locations available?

So, for Boston, the 50% of the listings are available for roughly half the year.Whereas for Seattle, at least 50% of the listings are available almost year round.

Here is the breakdown of availability by city:

Obviously, even accounting for availability, the range of locations available for both cities for most of the year (or almost all of the year in the case of Seattle) is suprising.

Price

The breakdown of prices per zipcode for each city is:

Where the prices in red are for zip codes where the price is at or below the 25th percentile.

We see that, very roughly, the prices increase the closer to the city center the listing is (although this is not a hard and fast rule).

So Boston is 50% more expensive than Seattle. Even then $150 can be viewed as reasonable (about what a Udacity reviewer gets an hour ... for a full day's stay in a major US city).

This part would require further analysis, I have no information about the different zip codes in each city (and their hospitality to visitors). So I will leave this part of the analysis (except to say that if you look at the ratings analysis below, grouped by by zip code, there is little correspondence between low prices and low ratings).

Reviews & Ratings

Overview of ratings

The datasets contain a wealth of information regarding reviews of listings. There are numeric assessments, accuracy of numeric assessments, and the actual comments by users, to list just a few variables relating to listing reviews.

We'll begin by looking at the ratings given by users, per city, aggregated by zip code:

This is one surprising feature of this analysis, you would think that all of the features that are attractive about a listing (location, ammenities, the quality of the listing) would be reflected in both the rating and the price. In this dataset, this is not the case.

A service that has a rating of 100 for 1 in 5 (Boston) or 1 in 4 (Seattle), or above 90 for 75% (Boston) or close to 90% (Seattle) is doing something right. This is the standout feature of AirBnB for me. For a decentralized service to have such outstanding ratings is very impressive!

How do prices vary with quality?

I now look at the percentiles of ratings, and get the average prices for each percentile.

As discussed early, Boston listings are more expensive than Seattle listings. Here we see, as expected, the higher rated listings, typically, have higher prices. What is unusual is that the highest rated Seattle listings, typically, cost less than the lowest rated listings in Boston.

An overview of reviews

As we saw, the reviews are mostly positive. We are going analyze the reviews to see the sentiments expressed in the review. We are going use natural language processing to scan each of the reviews for key "sentiments" that are expressed in the reviews.

We will do this in three ways.

Naive Examination

The results for the positive reviews are not too surprising, the results for the negative sentiments are shocking.

Let's scan through some of the reviews to see where the words die, rob, killer and war appear. It turns out, we only need to look at the one example of each to see why those apparently shocking terms are actually not that surprising.

Searching the comments for surprising terms

Die

Killer

Rob

War

So some of the reviews are not in english, and some of the "sentiments" detected are due to colloquial usages. First, let's remove the non-english reviews.

English Only Reviews

We no longer see "war". However, surprisingly, "die" remains and "dead" becomes more prominent, but "kller" (which is mainly english colloquial) doesn't. Unsurprisingly, "rob" still remains. Let's see why "die" is used.

Searching the comments for surprising terms

Ok, more colloquial usage, what about "killer"?

and finally, "dead"

Let's See if the colloquial usage is common across cities

It is interesting that there is very little difference in the words used to express sentiments for each city. The only notable difference is the prominence of the word "killer" in the Seattle sentiments (and the fact that someone called "Rob" obviously has more listings in Seattle).

Adding Stopwords

The next step is to remove the words that are deemed to have a negative sentiment, but are used colloquially, or are actually names - That is, we will remove die, dead, killer, and rob:

The last step is to reflect the actual proportion of sentiments used. That is, we know that the ratings are excellent for most listings, so we would presume that the sentiments expressed in reviews would reflect this. So, let's make the area of each word cloud proportional to the sentitments found.

Word Clouds Proportional To Sentiments Expressed

This final visualization highlights the disparity between the type of sentiments expressed. That is, every time a negative sentiment was expressed, on average, 77 positive sentiments were expressed

Conclusion

As far back as 2016, when these datasets were compiled, AirBnB was very active (at least in the cities that we have data for).

We found some surprising results:

We found some entertaining results:

Obviously, this is just "the tip of the iceberg". For example, the price analysis would involve gathering information about amenities, listing rental unit size, and so on. But hopefully it was an enjoyable read!